options(knitr.table.format = "html")
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
library(ggplot2)
library(tidyverse)
library(Biobase)
library(lavaan)
library(HIMA)
library(bama)
library(LUCIDus)
library(reshape2)
library(networkD3)
library(sm)
library(plotly)
library(janitor)
library(dplyr)
# read in data
work.dir = "/cloud/project/Labs/Mediation"
dat = readRDS(file.path(work.dir, "HELIX_data.rds"))
#loading plotting function for LUCID (highly customizable functions for complex LUCID model that are not built in the LUCIDus package)
source(file.path(work.dir, "/plotting_func/plot_lucid_in_parallel.R"))
source(file.path(work.dir, "plotting_func/plot_lucid_in_serial.R"))
source(file.path(work.dir, "plotting_func/plot_omics_profiles_general.R"))
LUCID allows user to incorporate and analyze multiple omics layers at the same time. LUCID is an integrative analysis, which jointly estimates latent clusters based on multiple omics layers and associates latent clusters with outcome of interest. In the lecture, we have covered the theoretical part of the LUCID model. We will go through the details of its application in a real world study.
In 2002, physician Paula Baillie-Hamilton conducted a study revealing that the obesity epidemic correlated with the increased production of synthetic chemicals. Among them, organochlorines (OCs) constitute a major class of synthetic chemicals with suspected obesogenic properties. The effects of OC are biomagnified through the food chain, and especially affect the health outcome of pregnant woman and children.
OCs include synthetic chemicals that were widely used as pesticides and in industrial processes throughout most of the 20th century. The use of these chemicals are discontinued in the United States and Europe, however, because of their persistence in the environment, the general population is still exposed to these substances at low doses, and adverse health outcomes related to background population levels of exposure are a major concern. OCs are ubiquitous and persist in the environment, accumulate in high concentrations in fatty tissues, and are biomagnified through the food chain.
Pregnancy and childhood constitute periods of high vulnerability to chemical toxic effects, as it is when rapid tissue development occurs while there is incomplete development or function of protective mechanisms, such as xenobiotic metabolism and immune function
Research Question: 1. Explore the association between exposure to OCs and BMI (measured as a standardized BMI z-score) 2. Estimate the latent clusters in children characterized by exposome (exposure to OCs) and multiple omics layers (proteomics, serum metabolites and urinary metabolites); construct exposure and omics profiles for children with high risk of obesity
We use the LUCID model to analyze this research question, which can
be expressed as the DAG below.
First, let’s prepare data for LUCID. The LUCID model takes
matrix, or matrix-like data (such as
data.frame) as input. If input is vector, it will be
transformed into a matrix automatically. Besides, since we
incorporate multiple layers of omics data which are measured at
different scales, it’s recommended to standardize multiple omics data to
make sure they are at the same scale.
#===================================#
## 1. prepare data for LUCID model ##
#===================================#
exposure = dat[, 2:19]
omics = scale(as.matrix(dat[, 27:ncol(dat)])) # scale the omics
outcome = dat[, "hs_zbmi_who"]
cova_names = c("h_mbmi_None", "e3_sex_None", "h_age_None", "h_cohort", "h_edumc_None")
M_names = colnames(omics)
expo_names = colnames(exposure)
covariate = model.matrix(~., dat[, cova_names])[, -1]
Next, we conduct a preliminary screening to decrease the number of
omics variables to a moderate number. Here we use the idea of pairwise
association between exposure and omics layer, and between omics layer
and the outcome. This is similar to the “meet in the middle” approach we
conducted in the previous lab.
### Balanced Omics Feature Selection Using Modified M-i-M (FDR =
0.1)
We applied a two-step modified Meet-in-the-Middle (M-i-M) procedure to identify omics features that are both associated with exposures and predictive of the outcome. To ensure equal representation across omics layers, which was encouraged for early integration of multi-omics data. we selected exactly 10 features per layer, prioritizing exposure-responsive features but allowing fallback to top outcome-associated features when necessary.
Step 1 (X → M | Y):
For each omics layer, we test the association between each exposure and
each omics feature, adjusting for the outcome and covariates. Features
significantly associated with at least one exposure (FDR < 0.1) are
flagged as exposure-responsive.
Step 2 (M → Y):
All features in each layer are tested for association with the outcome.
Features are ranked by FDR-adjusted p-values. The top 10 per layer are
selected, giving priority to exposure-responsive features but
supplemented with top Y-associated features if needed.
This approach ensures 30 final features (10 per layer), balancing statistical relevance and biological plausibility across omics types.
#===========================#
## Group omics by layer ##
#===========================#
omics <- as.data.frame(omics)
S_metab <- omics[, grep("^serum", names(omics), value = TRUE)]
U_metab <- omics[, grep("^urine", names(omics), value = TRUE)]
Proteins <- omics[, setdiff(names(omics), c(names(S_metab), names(U_metab)))]
omics_layers <- list(
serum = as.matrix(S_metab),
urine = as.matrix(U_metab),
protein = as.matrix(Proteins)
)
#===========================#
## Layer-wise selection ##
#===========================#
selected_M_by_layer <- list()
for (layer in names(omics_layers)) {
layer_data <- omics_layers[[layer]]
layer_M_names <- colnames(layer_data)
## Step 1: Exposure → M | Y
res_M_E_list <- lapply(layer_M_names, function(m) {
lapply(expo_names, function(e) {
f <- as.formula(paste(m, "~", e, "+ hs_zbmi_who +", paste(cova_names, collapse = "+")))
coefs <- summary(lm(f, data = dat))$coefficients
data.frame(M = m, E = e, est = coefs[2, 1], p = coefs[2, 4])
})
})
res_M_E <- do.call(rbind, unlist(res_M_E_list, recursive = FALSE))
res_M_E$p.adj <- p.adjust(res_M_E$p, method = "fdr")
responsive_M <- unique(res_M_E$M[res_M_E$p.adj < 0.1])
## Step 2: M → Y (on all features in the layer)
res_M_Y_list <- lapply(layer_M_names, function(m) {
f <- as.formula(paste("hs_zbmi_who ~", m, "+", paste(cova_names, collapse = "+")))
coefs <- summary(lm(f, data = dat))$coefficients
data.frame(M = m, p = coefs[2, 4])
})
res_M_Y <- do.call(rbind, res_M_Y_list)
res_M_Y$p_fdr <- p.adjust(res_M_Y$p, method = "fdr")
## Prioritize responsive M, fill to 10
res_M_Y <- res_M_Y %>%
arrange(p_fdr) %>%
mutate(priority = ifelse(M %in% responsive_M, 1, 2)) %>%
arrange(priority, p_fdr)
top10 <- head(res_M_Y$M, 10)
selected_M_by_layer[[layer]] <- top10
}
# Final selected omics features across all layers
selected_M <- unlist(selected_M_by_layer)
cat("Total selected features:", length(selected_M), "\n") # Should be 30
## Total selected features: 30
The preliminary screening selected 30 omics out of 257 omcis variables.
After preliminary screening, 101 proteins
or metabolites are selected. The number is still moderately large and we
hope to further decrease the number of omics variables to facilitate
interpretation of the LUCID model. LUCID comes with an integrated
variable selection based on \(L_1\)
penalties. Next, we show how to tune penalty terms for LUCID and conduct
integrated variable selection.
#===================================#
## 3. conduct variable selection ##
#===================================#
# step 1: tune the model parameters
#=============================================================
# this chunk of codes is time-consuming, you can try it later;
# we use the results of this tuning process directly
#=============================================================
# set.seed(123)
# tune_lucid <- lucid(G = exposure,
# Z = omics_selected,
# Y = outcome,
# CoY = covariate,
# Rho_G = c(0, 0.05, 0.1),
# Rho_Z_Mu = c(0.05, 0.1, 0.15),
# Rho_Z_Cov = c(0.05, 0.1, 0.15),
# K = c(2:4),
# lucid_model = "early",
# init_omic.data.model = NULL)
# # check the optimal tuning parameters
# tune_lucid$tune_list
# step 2: integrated variable selections
#==========================================================================
# optimal model: K = 4, Rho_Z_Cov = 0.05, Rho_Z_Mu = 0.05, Rho_G = 0.0
#==========================================================================
set.seed(123)
# Fit the LUCID model with early integration, using selected exposures and omics features
fit_try1 <- estimate_lucid(
G = exposure, # G: matrix of selected exposures (e.g., environmental variables)
Z = omics[, selected_M], # Z: matrix of selected omics features (e.g., metabolites)
Y = outcome, # Y: outcome variable (e.g., BMI z-score)
CoY = covariate, # CoY: covariates to adjust for in the Y model (e.g., age, sex, etc.)
useY = FALSE, # Whether to use Y during model fitting (TRUE = supervised clustering)
K = 4, # Number of latent clusters to estimate
family = "normal", # Distribution family for outcome Y ("normal" for continuous outcomes)
Rho_Z_Cov = 0.1, # Penalty parameter for the covariance matrix of Z (encourages sparsity)
Rho_Z_Mu = 1, # Penalty parameter for the cluster-specific mean of Z (for feature selection)
Rho_G = 0, # Penalty parameter for the effect of G on cluster membership (0 = no penalty)
lucid_model = "early", # Integration strategy: "early" = joint modeling of G, Z, and Y
init_omic.data.model = NULL, # Optional: specify mclust model for Z; NULL lets mclust choose the best model,
)
#check how many omics are selected out of 30
sum(fit_try1$select$selectZ)
#88 were selected
# refit the model with the selected features, we don't need penalty for this
set.seed(123)
fit_try2 = estimate_lucid(G = exposure,
Z = omics[, selected_M][, fit_try1$select$selectZ],
Y = outcome,
CoY = covariate,
useY = FALSE,
K = 4,
family = "normal",
lucid_model = "early",
init_omic.data.model = fit_try1$init_omic.data.model)
summary summarizes LUCID model and print out a table
explaining associations among each components of LUCID. This summary
table is very useful when a few number of environmental/genetic
exposures and omic variables are included. Otherwise, it’s better to
interpret LUCID model through a Sankey diagram, as we’ll show later.
summary(fit_try1)
## ----------Summary of the LUCID Early Integration model----------
##
## K = 4 , log likelihood = -42645.59 , BIC = 112043.1
##
## (1) Y (continuous outcome): effect size of Y for each latent cluster (and effect of covariates if included)
## Gamma
## cluster1 0.000000000
## cluster2 -0.692657964
## cluster3 -0.288355028
## cluster4 1.090351914
## h_mbmi_None 0.047112347
## e3_sex_Nonemale 0.231709786
## h_age_None 0.006758062
## h_cohort2 -0.523571477
## h_cohort3 -0.092012729
## h_cohort4 0.245864981
## h_cohort5 0.153513780
## h_cohort6 0.355271991
## h_edumc_None2 0.042926879
## h_edumc_None3 0.061160477
##
## (2) Z: mean of omics data for each latent cluster
## mu_cluster1 mu_cluster2 mu_cluster3 mu_cluster4
## serum_metab_95 0.00000000 -0.17215825 -0.286693679 0.42441303
## serum_metab_161 0.00000000 -0.17400752 0.017736264 0.35992051
## serum_metab_59 0.00000000 -0.02911857 -0.468116132 0.17055350
## serum_metab_85 0.00000000 -0.08925315 -0.478412885 0.29852582
## serum_metab_79 0.00000000 -0.03513718 -0.334050061 0.14758341
## serum_metab_102 0.00000000 -0.10441161 -0.147147154 0.25532372
## serum_metab_149 0.00000000 -0.07755897 0.181420559 0.11058572
## serum_metab_175 0.00000000 -0.04583465 -0.428024892 0.19690549
## serum_metab_103 0.00000000 0.00000000 -0.083816123 0.02037608
## serum_metab_119 0.00000000 0.03989559 -0.103377162 -0.04205404
## urine_metab_10 0.00000000 -0.03566755 -0.551701052 0.20214352
## urine_metab_6 0.03301145 -0.17996757 -0.479637434 0.48750798
## urine_metab_4 0.00000000 -0.09057700 -0.248641231 0.26406628
## urine_metab_13 0.00000000 0.11494312 -0.243420266 -0.17586179
## urine_metab_27 0.00000000 0.09737545 0.298586011 -0.28088029
## urine_metab_5 0.00000000 -0.17646288 0.657768106 0.20916555
## urine_metab_9 0.00000000 -0.02099490 0.150843619 0.00771194
## urine_metab_1 0.00000000 -0.06072911 0.213301717 0.06988198
## urine_metab_25 0.00000000 0.16541754 -0.036096912 -0.33258456
## urine_metab_31 0.00000000 0.07809542 -0.158557016 -0.13257482
## IL1beta 0.00000000 -0.49278647 -0.250859735 1.06690972
## IL6 0.00000000 -0.49580727 0.102448195 0.98939427
## Leptin 0.00000000 -0.32487488 -0.194574150 0.71602721
## IL10 0.00000000 0.11090282 -0.098998745 -0.20929624
## IP10 0.00000000 0.00000000 0.298630185 -0.07396379
## PAI1 0.00000000 0.03491634 0.015160507 -0.08682869
## INSULIN 0.00000000 -0.22220002 -0.226551099 0.50524929
## CRP 0.00000000 -0.21404398 0.004530880 0.43596969
## BAFF 0.00000000 -0.14754302 0.003426692 0.29424738
## HGF 0.00000000 -0.17066722 0.218637472 0.29904748
##
## (3) E: odds ratio of being assigned to each latent cluster for each exposure
## beta OR
## hs_dde_cadj_Log2.cluster2 1.8725909 6.505129e+00
## hs_dde_madj_Log2.cluster2 13.5935965 8.009842e+05
## hs_ddt_cadj_Log2.cluster2 -0.6103345 5.431692e-01
## hs_ddt_madj_Log2.cluster2 5.1910559 1.796582e+02
## hs_hcb_cadj_Log2.cluster2 -3.6801939 2.521808e-02
## hs_hcb_madj_Log2.cluster2 2.7955499 1.637163e+01
## hs_pcb118_cadj_Log2.cluster2 5.6700425 2.900469e+02
## hs_pcb118_madj_Log2.cluster2 -32.8708219 5.301313e-15
## hs_pcb138_cadj_Log2.cluster2 -0.3662958 6.932977e-01
## hs_pcb138_madj_Log2.cluster2 15.1025473 3.622038e+06
## hs_pcb153_cadj_Log2.cluster2 44.8250988 2.932872e+19
## hs_pcb153_madj_Log2.cluster2 -2.0265771 1.317858e-01
## hs_pcb170_cadj_Log2.cluster2 6.6531501 7.752225e+02
## hs_pcb170_madj_Log2.cluster2 20.1741172 5.774412e+08
## hs_pcb180_cadj_Log2.cluster2 7.5071144 1.820951e+03
## hs_pcb180_madj_Log2.cluster2 16.9756217 2.357322e+07
## hs_sumPCBs5_cadj_Log2.cluster2 -15.9016284 1.241682e-07
## hs_sumPCBs5_madj_Log2.cluster2 -8.6729947 1.711458e-04
## hs_dde_cadj_Log2.cluster3 1.7556918 5.787450e+00
## hs_dde_madj_Log2.cluster3 13.7352136 9.228423e+05
## hs_ddt_cadj_Log2.cluster3 -0.6546072 5.196462e-01
## hs_ddt_madj_Log2.cluster3 5.1713289 1.761488e+02
## hs_hcb_cadj_Log2.cluster3 -3.5838240 2.776931e-02
## hs_hcb_madj_Log2.cluster3 3.2466113 2.570309e+01
## hs_pcb118_cadj_Log2.cluster3 6.2212723 5.033432e+02
## hs_pcb118_madj_Log2.cluster3 -32.7298465 6.103912e-15
## hs_pcb138_cadj_Log2.cluster3 -0.1693120 8.442455e-01
## hs_pcb138_madj_Log2.cluster3 15.0201013 3.335394e+06
## hs_pcb153_cadj_Log2.cluster3 44.1175379 1.445451e+19
## hs_pcb153_madj_Log2.cluster3 -2.3810098 9.245717e-02
## hs_pcb170_cadj_Log2.cluster3 6.5885999 7.267626e+02
## hs_pcb170_madj_Log2.cluster3 19.7234115 3.679330e+08
## hs_pcb180_cadj_Log2.cluster3 7.6231962 2.045088e+03
## hs_pcb180_madj_Log2.cluster3 17.2171659 3.001374e+07
## hs_sumPCBs5_cadj_Log2.cluster3 -16.0336818 1.088079e-07
## hs_sumPCBs5_madj_Log2.cluster3 -8.3034871 2.476517e-04
## hs_dde_cadj_Log2.cluster4 1.7711077 5.877360e+00
## hs_dde_madj_Log2.cluster4 13.7252267 9.136718e+05
## hs_ddt_cadj_Log2.cluster4 -0.6148026 5.407476e-01
## hs_ddt_madj_Log2.cluster4 5.3517591 2.109791e+02
## hs_hcb_cadj_Log2.cluster4 -4.9394434 7.158582e-03
## hs_hcb_madj_Log2.cluster4 3.4996845 3.310501e+01
## hs_pcb118_cadj_Log2.cluster4 6.2967587 5.428096e+02
## hs_pcb118_madj_Log2.cluster4 -32.5629291 7.212728e-15
## hs_pcb138_cadj_Log2.cluster4 0.4445751 1.559827e+00
## hs_pcb138_madj_Log2.cluster4 15.0617613 3.477281e+06
## hs_pcb153_cadj_Log2.cluster4 44.1034375 1.425212e+19
## hs_pcb153_madj_Log2.cluster4 -1.7224810 1.786224e-01
## hs_pcb170_cadj_Log2.cluster4 6.5351135 6.889120e+02
## hs_pcb170_madj_Log2.cluster4 19.6213705 3.322408e+08
## hs_pcb180_cadj_Log2.cluster4 7.4160129 1.662392e+03
## hs_pcb180_madj_Log2.cluster4 17.1476956 2.799946e+07
## hs_sumPCBs5_cadj_Log2.cluster4 -16.4007073 7.538125e-08
## hs_sumPCBs5_madj_Log2.cluster4 -8.5630579 1.910342e-04
After fitting the LUCID model with selected features, we can predict
the cluster assignment for each observaton by calling the
predict_lucid function.
# prediction of LUCID model
pred_fit_try2 = predict_lucid(model = fit_try2,
G = exposure,
Z = omics[, selected_M][, fit_try1$select$selectZ],
CoY = covariate,
lucid_model = "early")
# prediction on latent cluster
table(pred_fit_try2$pred.x)
##
## 0 1 2 3
## 320 69 645 118
LUCID uses a Sankey diagram to visualize the complex associations among different components. In the Sankey diagram of LUCID, each node represents a variable in our dataset (for example, exposure to different organochlorines, various omics feature) and each link represents a statistical association. The color of the link indicates the direction of association: by default, light blue represents positive association while dark blue represents negative association. The width of the link indicates the magnitude of association: the wider a link is, the stronger the statisitcal association is.
#===================================#
## 3. visualize the LUCID model ##
#===================================#
# Since there are still a high number of omics features (88) included in the final LUCID model, we can only visualize top 10 omics features based on the absolute values of differnce across the 2 clusters
# 1. use internal plot function
plot(fit_try2)
# 2. personalize color
# user can personalize colors of node and link
plot(fit_try2,
G_color = "red",
X_color = "blue",
Z_color = "green",
Y_color = "black",
pos_link_color = "orange",
neg_link_color = "gray")
#========================================================================#
## 4. Distribution of zBMI score for each cluster predicted by LUCID ##
#========================================================================#
Y_fit_try2 = as.data.frame(cbind(cluster = as.factor(pred_fit_try2$pred.x), dat[, "hs_zbmi_who"]))
Y_fit_try2 = melt(Y_fit_try2, id.vars = "cluster")
ggplot(Y_fit_try2, aes(x = as.factor(cluster),
y = value,
goup = as.factor(cluster))) +
geom_boxplot() +
xlab("cluster") +
ylab("z-bmi")
To interpret the clustering results of LUCID, we can create box plot of BMI z-score for each latent luster. From the boxplot, we observe that children belonging to cluster 4 are at high risk of obesity while the first 3 clsuters tend to have a similar averaged BMI z-score.
#=============================================================#
## 5. Omics profiles for each cluster predicted by LUCID ##
#=============================================================#
M_mean = as.data.frame(fit_try2$res_Mu)
M_mean$cluster = as.factor(1:4)
M_mean_melt = melt(M_mean, id.vars = "cluster")
# add color label for omics layer
M_mean_melt$color_lable = rep("1", nrow(M_mean_melt))
M_mean_melt[grep("serum", M_mean_melt$variable), "color_lable"] = "2"
M_mean_melt[grep("urine", M_mean_melt$variable), "color_lable"] = "3"
ggplot(M_mean_melt, aes(fill = color_lable, y = value, x = variable)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Omics profiles for 4 latent clusters") +
facet_wrap(~cluster) +
facet_grid(rows = vars(cluster)) +
theme(legend.position="none") +
xlab("") +
theme(text = element_text(size=10),
axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1))
This figure, the y-axis represents the estimated mean expression level for each metabolite. We could observe distinguished pattern of omic profiles for each subgroup. We want to specifically highlight 3 proteins: insulin, IL1-beta and IL-6. Latent cluster 4, the high risk subgroup, is particularly characterized with elevated levels of IL-1beta and IL-6, two key markers of systemic inflammation in the human body and higher levels of insulin, which is known to closely associate with obesity and underlying metabolic dysfunction.
#================================================================#
## 6. Exposure profiles for each cluster predicted by LUCID ##
#================================================================#
E_mean = as.data.frame(fit_try2$res_Beta[, -1])
E_mean$cluster = as.factor(1:4)
E_mean_melt = melt(E_mean, id.vars = "cluster")
ggplot(E_mean_melt, aes(fill = variable, y = value, x = variable)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Exposure profiles for 4 latent clusters") +
facet_wrap(~cluster) +
facet_grid(rows = vars(cluster)) +
geom_hline(data = data.frame(yint=0, cluster ="1"), aes(yintercept = yint), linetype = "dashed", color = "red") +
theme(legend.position="none") +
xlab("") +
theme(text = element_text(size=10),
axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1))
This figure displays the log odds ratios of exposures across the four
latent clusters. Each bar represents the strength and direction of
association between a specific exposure and cluster membership, adjusted
for covariates.
Notably, hs_hcb_madj_Log2 (maternal HCB) shows a strong gradient across clusters, increasing from cluster 1 to cluster 4, suggesting it plays a key role in distinguishing exposure profiles. This trend supports a potential obesogenic effect of maternal HCB, supporting previous HELIX findings (Vrijheid et al., EHP 2020).
Other exposures also contribute to the cluster structure, including: Cord HCB (hs_hcb_cadj_Log2), which decreases from cluster 2 to 4 linking higher early-life HCB exposure to lower BMI z-scores; PCBs 118, 138, and 153 in both maternal and cord plasma; And DDE/DDT, which also show moderate directional shifts across clusters.
These patterns underscore the complex mixture structure of environmental exposures in this cohort, which LUCID early integration model helped to reveal. Rather than being driven by a single compound, the latent clusters reflect combinatorial signatures, with maternal HCB as a particular contributor. This highlights the value of mixture modeling approaches for identifying exposure profiles that may influence downstream health outcomes like BMI.
In the analysis above, we conducted early integration for multi-omics data by concatenating three layers (serum metabolites, urine metabolites and proteins). What will results differ if we conduct late integration (analyzing each omic layer separately then combine the results?) You can pick the selected exposure or any exposure of interest.
dat[, 2:19]dat[, 20:24]dat[, 26]dat[, 27:203]dat[, 204:247]dat[, 248:283]Three DAGs represent how three different types of the LUCID model integrate ge- netic/environmental exposures (G), other multi-omics data (Z), and the phenotype trait (Y). (a) LUCID early integration; (b) LUCID in parallel; (c) LUCID in serial.
In the previous sections, we focus on using the early integration strategy by concatenating each omic layer one by one and inputting a single data matrix into the LUCID model for estimation. However, researchers may be interested in modeling the correlation structure of each omic layer independently to investigate how multi-omics data may act in parallel with an outcome. Recent advancement of the LUCIDus package allows us to do it.
#================================================================#
## 7. Group different omic layers for selected omics features ##
#================================================================#
omics_selected = as.data.frame(omics[,selected_M])
# Extract serum features
S_metab <- omics_selected[, grep("^serum", names(omics_selected), value = TRUE)]
# Extract urine features
U_metab <- omics_selected[, grep("^urine", names(omics_selected), value = TRUE)]
# Extract Protein features
Proteins <- omics_selected[,names(omics_selected)[!grepl("^serum|^urine", names(omics_selected))]]
#Put them in a list
omics_selected_list <- list(as.matrix(S_metab), as.matrix(U_metab), as.matrix(Proteins))
#================================================================#
## 8. Intermediate Integration: LUCID in Parallel ##
#================================================================#
#Here, since we have a small number of features in each layer, we don't use regularity
fit_try3 = estimate_lucid(G = exposure,
Z = omics_selected_list,
Y = outcome,
CoY = covariate,
useY = FALSE,
K = c(2,2,2), #2 latent cluster for each layer
family = "normal",
lucid_model = "parallel")
#get the summary table of the model
#print.sumlucid(summary_lucid(fit_try3))
#add a prefic for proteins features as they don't have a common pattern
fit_try3$var.names$Znames[[3]] <- paste("protein", fit_try3$var.names$Znames[[3]], sep = "_")
#visualize LUCID in parallel
#Sankey Diagram
plot_lucid_in_parallel_plotly(fit_try3,
sankey_colors = sankey_colors_parallel,
text_size = 10,
n_z_ftrs_to_plot = c(10,10,10))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#omics profile
plot_omics_profiles(fit_try3, "intermediate",pattern_list = list("serum","urine","protein"), omics_list = list("serum","urine","protein"))
In a more late integration framework, LUCID can also be extended to incorporate multiple latent variables in a serial fashion if researchers believe that given an exposure, multi-omics data act serially through a multistep process towards the outcome. This framework can accommodate the following situations: (1) longitudinal measurements on the same multi-omics data type and (2) biological relationship of multi-omics data. The estimation of latent clusters for each omic layer can be formulated as an unsupervised LUCID early integration or LUCID in parallel sub-model.
#================================================================#
## 9. Late Integration: LUCID in Serial ##
#================================================================#
#Here, since we have a small number of features in each layer, we don't use regularity
fit_try4 = estimate_lucid(G = exposure,
Z = omics_selected_list,
Y = outcome,
CoY = covariate,
useY = FALSE,
K = c(2,2,2), #2 latent cluster for each layer
family = "normal",
lucid_model = "serial")
## ...................................................................................................................................................................................................
#get the summary table of the model
#print.sumlucid(summary_lucid(fit_try4))
#add a prefic for proteins features as they don't have a common pattern
fit_try4$var.names$Znames[[3]] <- paste("protein", fit_try4$var.names$Znames[[3]], sep = "_")
#visualize LUCID in serial
sankey_in_serial(fit_try4,
color_pal_sankey_serial,
text_size = 10)
#omics profile
plot_omics_profiles(fit_try4, "late",pattern_list = list("serum","urine","protein"), omics_list = list("serum","urine","protein"))